Introduction

The goal of this file is to document what has been done so far in my Stage en méthodes computationnelles under the supervision of Louis Renaud-Desjardins. Mainly, you will find in this file explanation of methodological choices for our work and the main results we got so far. All the explanation of how we fetch the various data is in another file C:/Users/jacob/OneDrive - Université Laval/biophilo/fetch_data_2024-11-18.Rmd.

Special thanks to François Claveau, Pierre-Olivier Méthot, Louis Renaud-Desjardins, Thomas Pradeu and Maël Lemoine for their support on this projet.

For any questions or issues, please write at .

# Some functions to display nice data table and to add percentages automatically
fct_percent <- function(x) {
  dt <-  x |>  mutate(percent = n/sum(x$n, na.rm = TRUE)*100) |>
    mutate(across(percent, round, 3)) 
  dt
}

fct_DT <- function(x) {
        dt <- DT:: datatable(head(x, 1000),
          options = list(scrollX = TRUE, 
                         paging=TRUE,
                         pageLength = 5))
        dt
        }

Database choice

Early in our work came a methodological choice, namely, choosing between two well known databases : Web of Science (WoS) and Scopus. We chose to go with Scopus. In summary, this choice is justified by two main reasons :

  1. Scopus has a wider range of coverage when it comes to the journals we want to investigate, that is the main journals of philosophy of biology. For example, it covers Biological Theory, which is absent of our Web of Science database.

  2. Web of Science for philosophy is missing many citing to cited document link.

If you want in depth details about both databases and their strengths and weaknesses given our corpus, follow along. If you want to see the results, you can skip and go straight to the results section.

Coverage across different databases : Springer, Web of Science and Scopus.

Here is how we fetch the information from the different databases.

  • For Springer, we looked manually at each volume and create an excel sheet with the number of articles per year.

  • For Web of Science we fetched the data through the Albator database which we got access to via the OST. The detailed information about the SQL query already provided earlier.

  • For Scopus, we used Scopus API and the rscopus package to get our data. See the code below for the specific workflow.

To compare each database coverage (Web of Science, Scopus and Springer) we will perform our tests based on the well-known Biology & Philosophy journal present in each database.

bp_db <- read_csv(paste0(dir_od,"bp_article_db.csv"), skip = 1) # This is the data from B&P Springer and not WoS.
bio_philo_papers <- read_csv(paste0(dir_od, "bio_philo_papers.csv"))
bio_philo_affiliations <- read_csv(paste0(dir_od, "bio_philo_affiliations.csv"))
bio_philo_authors <- read_csv(paste0(dir_od, "bio_philo_authors.csv"))
citing_articles <- bio_philo_papers$`dc:identifier` # extracting the IDs of our articles
bio_th_db <- read_csv(paste0(dir_od, "biological_theory_bp_2024-10-7.csv")) # This is the data from Bio. Th. Springer and not Scopus.

Articles

Let’s start by making sure that we have a good coverage for articles. Unsurprisingly, the number of article listed in Springer are more numerous than in the other databases. At first sight, Web of Science seems to have a better coverage, with a difference of more than 200 articles when compared to Scopus.

fct_DT(
  art_all |>
  group_by(FROM) |>
  summarise(total_N = sum(N)) |>
  arrange(desc(total_N))
)

Lets look at the distribution of those articles since the beginning of Biology & Philosophy. We see that 1) Web of Science has a good coverage except for early and recent years of B&P whereas Scopus do pretty well in the same range. However, Scopus seems to lose a lot of articles from the decade 1995-2005 (not shown in the histogram for visibility). However, when we filter only on articles, Scopus gets actually a better coverage overall.

References

Now that we have a better idea of the articles we are able to get from both databases, lets look at their references.

For Web of Science, we uncover that many many references had no unique identifier (around 1/3 to potentially 1/2).

Here is some quantitative data :

  1. We have a total of 67 774 cited document for our B&P corpus.
  2. 43 828 cited document have a Cited_Title (~ 24 000 lost already).
  3. 31 439 cited document have Cited_Title and/or an OST_BK_Ref (the OST_BK or unique identifier for a cited document, see “Cited authors (references’ authors)” section of the SQL script above).

Given that this 31 439 cited document corpus doesn’t contain book and book chapter which we are interested in getting, this is problematic. Compared to this important limitation, Scopus do much better.

We must thank Aurélien Goutsmedt which made us aware of an API given by Scopus. It eased out our work substantially (for more information, see his blog here).

Here, compared to our WoS data, we get more than 68 800 references with close to 100% match between references and articles. As a reminder we had around 64 000 references when going through WoS and almost 1/3 of them had no link to their respective article (in the histogram, see the Cited_ID comparison).

# Completeness of data -----------------------------------------------------
df <- tibble(ref_bp) |> mutate(across(where(is.character), ~ na_if(., "NULL")))

df <-  df |> rename(Citing_ID = OST_BK, Cited_ID = OST_BK_Ref, Cited_Year = Year)

# Create a tibble summarizing total rows, NA values, and non-NA values by column
summary_tibble <- tibble(
  column = names(df),
  total_rows = nrow(df),
  na_count = sapply(df, function(x)
    sum(is.na(x))),
  non_na_count = sapply(df, function(x)
    sum(!is.na(x)))
)

summary_tibble <-  summary_tibble |>  
  mutate(percent = non_na_count / total_rows *100) |> 
  mutate(across(percent, round, 3)) |>
  arrange(desc(percent))

summary_tibble_WoS <- summary_tibble |> 
  filter(column != "UID" & column != "UID_Ref")

summary_tibble_WoS$column <- factor(summary_tibble_WoS$column, levels = c("Citing_ID", "Cited_ID","Cited_Author", "Cited_Year", "Cited_Work", "Cited_Title"))
summary_tibble_WoS <-summary_tibble_WoS |> 
  mutate(column_adjust = column) |> # this will simplify our work next
  mutate(FROM = "Web of Science") 

Scopus : Biological Theory

Another reason why it can be interesting to choose Scopus is that it covers more journals in philosophy of biology such as Biological Theory (BT).

Coverage accross different databases

Articles

Let’s look at the article we are able to fetch with Scopus API compared to the articles listed on Springer for the journal. We created a .csv counting manually all the articles listed on Springer for BT and compared it with what we got with the API.

The first step is to compare the coverage between Springer and Scopus. As we see, both are pretty close Springer getting a little bit less than 60 article more than Scopus.

#Table

fct_DT(
  art_all_BT |>
  group_by(FROM) |>
  summarise(total_N = sum(N)) |>
  arrange(desc(total_N))
)

When we look at the specific coverage for each year, we see that the coverage is pretty good. However, 2024 is a strange year where the coverage of Scopus is better than the one of Springer. We don’t understand why at the moment.

References

As we see, the coverage in Scopus resembles the one in Springer when it comes to articles.

Now, let’s look at the references. For the journal Biological Theory, we get 35 793 references in total.

We have almost 100% non-na entries for a) Cited_Authors, b) Citing_ID, c) Cited_ID, d) Cited_Work which is either the article name or the book name. We should not be too bothered with the fact that the column Cited_Title as around 40% of NA entries since books do not have them typically.

Something that can look more bothersome is the Cited_Year column 40% of NA values. Looking into it, we can easily understand why there is so many NAs.The main reason is because Scopus has done before hand cleaning beforehand, notably for books that have many editions. You can demonstrate this by fetching the data directly from Scopus website and compare it to what we get with the API.

Lets look at an example. Here, we see that the famous book by Richard Dawkins The Selfish Gene has been referred to with different publication years (i.e. 1976 ans 1989). It is also the case for Odling-Smee et al. seminal work Niche Construction: The Neglected Process in Evolution which gets cited with two different year. If we look at the data from Scopus API we see that Dawkins book get no year attributed to it and that they many similar but different entries are under the same unique identifier (scopus_id).

# REGEX FOR VARIOUS REFERENCES' EXTRACTION --------------------------------
# Define extraction patterns
extract_authors <- paste0(
  "^",                                         
  "(?:[A-Z]+(?:[-'][A-Z]+)*\\s+)*",  # Matches first part of author name allowing hyphens and apostrophes
  "(?:[A-Z]+(?:[-'][A-Z]+)*)",      # Matches last part of author name allowing hyphens and apostrophes
  "\\s+[A-Z](?:\\.[A-Z])*\\.",      # Matches initials (e.g., J. or J.A.)
  "(?:,\\s+",                                  
  "(?:[A-Z]+(?:[-'][A-Z]+)*\\s+)*",  # Matches first part of additional author names
  "(?:[A-Z]+(?:[-'][A-Z]+)*)",      # Matches last part of additional author names
  "\\s+[A-Z](?:\\.[A-Z])*\\.",      # Matches initials of additional authors
  ")*"                                         
)

extract_year <- "\\b(\\d{4})\\b"
extract_journal <- "[A-Z][A-Za-z\\s]+(?=\\,\\s\\d)" 
extract_volume <- "\\b\\d+\\b(?=\\,|\\s)" 
extract_issue <- "(?<=\\,\\s)(?:[A-Z])?\\d+(?=\\,|\\s|\\()|(?<=\\,\\s)[A-Z]\\d+(?=\\,|\\s|\\()"
extract_pages <- "\\bP{0,1}\\.\\s*\\d+(-\\d+)?\\b"

references_extract$references <- toupper(references_extract$references)


extraction <- function(ref) {
  # Extract components
  year <- str_extract(ref, extract_year)
  authors <- str_extract(ref, extract_authors)
  journal <- str_extract(ref, extract_journal)
  pages <- str_extract(ref, extract_pages)
  
  # Extract volume and issue separately
  volume_issue <- str_extract(ref, "\\b\\d{1,4}\\b(,\\s*\\d{1,4})?")
  
  # Split into volume and issue if both are present
  if (!is.na(volume_issue)) {
    volume_issue_split <- str_split(volume_issue, ",\\s*")[[1]]
    volume <- volume_issue_split[1]  # First part is the volume
    issue <- ifelse(length(volume_issue_split) > 1, volume_issue_split[2], NA)  # Second part is the issue, if it exists
  } else {
    volume <- NA
    issue <- NA
  }
  
  # Clean up formats
  year <- str_trim(year)
  pages <- ifelse(!is.na(pages), str_extract(pages, "\\d+(-\\d+)?"), NA)
  
  # Create a vector of extracted components
  extracted_parts <- c(authors, year, journal, volume, issue, pages)
  
  # Remove extracted parts and clean the remaining reference
  remaining_ref <- ref %>%
    str_remove_all(paste0(extracted_parts, collapse = "|")) %>%
    str_remove_all(",\\s*") %>%
    str_remove_all("\\s*\\(.*?\\)\\s*") %>%
    str_remove_all("P\\.\\s*|PP\\.\\s*") %>%
    str_remove_all("^\\s*|\\s*$") %>%
    str_trim()
  
  tibble(
    extracted_year = year,
    extracted_authors = authors,
    unique_author = authors, # This extra column is to get all unique author for later count. 
    extracted_journal = journal,
    extracted_volume = volume,
    extracted_issue = issue,
    extracted_pages = pages,
    remaining_ref = remaining_ref 
  )
}


# Apply the function and handle nested results
results <- references_extract %>%
  mutate(
    extraction_results = map(references, extraction)  # Apply function to each reference
  ) %>%
  unnest_wider(extraction_results)  # Unnest the tibble returned by `test_extraction`

#write_csv(results2, paste0(dir, "results2.csv"))
# View the results
fct_DT(results)

results_split <- results %>%
  separate_rows(unique_author, sep = ",\\s*")  # Split authors into multiple rows
fct_DT(results_split)


# SAVE RESULTS ------------------------------------------------------------
write_csv(results, paste0(dir_od, "cleaned_ref.csv"))
write_csv(results_split, paste0(dir_od, "cleaned_ref_split.csv"))
count_ref_art <- results |> 
  filter(!is.na(extracted_authors)) |>
  select(extracted_authors, extracted_year, remaining_ref) |>
  add_count(remaining_ref, extracted_authors) |> 
  unique() |>
  arrange(desc(n))

fct_DT(count_ref_art)

Comparative results of Biology & Philosophy and Biological Theory

Now that we have checked that the coverage for got both articles and their references is satisfying, we can dig into some of the results. Let’s start with the articles.

Articles’ results

Authors that published the most in Biology & Philosophy

author_count_bp <- bio_philo_authors |> select(authname) |> count(authname) |> arrange(desc(n))
author_count_bp <- fct_percent(author_count_bp) 
fct_DT(author_count_bp)

Authors that published the most in Biological Theory.

author_count_th <- bio_th_authors |> select(authname) |> count(authname) |> arrange(desc(n))
author_count_th <- fct_percent(author_count_th) 
fct_DT(author_count_th)

Comparison authors that published the most in B&P vs BT.

?dense_rank
## starting httpd help server ... done
author_count_th <- author_count_th |> mutate(rank_in_th = dense_rank(-author_count_th$n))
author_count_bp <- author_count_bp |> mutate(rank_in_bp = dense_rank(-author_count_bp$n))

author_count_tbl <- full_join(author_count_th |> select(authname, rank_in_th), author_count_bp |> select(authname, rank_in_bp), by = "authname")
fct_DT(author_count_tbl)

References’ results

Most cited references in Biology & Philosophy

clean_references_bp <- clean_references_bp |> mutate(year = as.Date(year)) |> mutate(year = year(year)) #To get only the year. 

most_c_ref_bp <- clean_references_bp  |> filter(!is.na(sourcetitle)) |>  
  select(scopus_id, author, year, sourcetitle, title, type) |> add_count(sourcetitle, title, author, scopus_id)  |> arrange(desc(n)) |> distinct() 

most_c_ref_bp <- fct_percent(most_c_ref_bp)
fct_DT(most_c_ref_bp |> select(-scopus_id, -type))

Most cited references in Biological Theory

clean_references_th <- clean_references_th |> mutate(year = as.Date(year)) |> mutate(year = year(year)) 

most_c_ref <- clean_references_th  |> filter(!is.na(sourcetitle)) |>  select(scopus_id, author, year, sourcetitle, title, type) |> add_count(sourcetitle, title, author, scopus_id)  |> arrange(desc(n)) |> distinct() 
most_c_ref_th <- fct_percent(most_c_ref)
fct_DT(most_c_ref_th |> select(-scopus_id, -type))

Most cited references in both corpus

# BIOLOGY & PHILOSOPHY
clean_references_bp <- clean_references_bp |> add_count(scopus_id, author, sourcetitle, title, author_list_author_ce_initials, year, name = "most_n")

clean_references_bp <- clean_references_bp |>
  distinct() |> 
  filter(!is.na(scopus_id))

clean_references_bp <- clean_references_bp |> 
  group_by(scopus_id) |> 
  filter(most_n == max(most_n)) |> 
  slice_head(n = 1) |> 
  ungroup() 

rank_bp <- clean_references_bp |> mutate(rank_in_bp = dense_rank(-clean_references_bp$n)) |> arrange(desc(n))


# BIOLOGICAL THEORY
clean_references_th <- clean_references_th |> add_count(scopus_id, author, sourcetitle, title, author_list_author_ce_initials, year, name = "most_n")

clean_references_th <- clean_references_th |>
  distinct() |> 
  filter(!is.na(scopus_id))

clean_references_th <- clean_references_th |> 
  group_by(scopus_id) |> 
  filter(most_n == max(most_n)) |> 
  slice_head(n = 1) |> 
  ungroup() 

rank_th <- clean_references_th |> mutate(rank_in_th = dense_rank(-clean_references_th$n)) |> arrange(desc(n))


cited_authors_tbl <- full_join(
  rank_th |> select(scopus_id, author, year, sourcetitle, title, rank_in_th),
  rank_bp |> select(scopus_id, rank_in_bp),
  by = "scopus_id") 

fct_DT(cited_authors_tbl |> select(-scopus_id))

Most cited authors in Biological Theory

count_cited_authors <- clean_references_bp |> select(author,    
author_list_author_ce_initials) |> count(author,    
author_list_author_ce_initials) |> arrange(desc(n))
count_cited_authors <- fct_percent(count_cited_authors)
fct_DT(count_cited_authors)
count_cited_authors <- clean_references_th |> select(author,    
author_list_author_ce_initials) |> count(author,    
author_list_author_ce_initials) |> arrange(desc(n))
count_cited_authors <- fct_percent(count_cited_authors)
fct_DT(count_cited_authors)

Journals results

Most cited journals in Biology & Philosophy

cited_journals_bp <- clean_references_bp |> select(scopus_id, sourcetitle, title) |> filter(!is.na(title)) |> count(sourcetitle)  |> arrange(desc(n))

cited_journals_bp <- fct_percent(cited_journals_bp)
fct_DT(cited_journals_bp)
write_csv(cited_journals_bp, paste0(dir_od, "cited_journals_bp_2024-12-07.csv"))

Most cited journals in Biological Theory

cited_journals_th <- clean_references_th |> select(scopus_id, sourcetitle, title) |> filter(!is.na(title)) |> count(sourcetitle)  |> arrange(desc(n))

cited_journals_th <- fct_percent(cited_journals_th)
fct_DT(cited_journals_th)

Comparison cited journals

journal_rank_bp <- cited_journals_bp |> mutate(rank_in_bp = dense_rank(-cited_journals_bp$n)) |> arrange(desc(n))
journal_rank_th <- cited_journals_th |> mutate(rank_in_th = dense_rank(-cited_journals_th$n)) |> arrange(desc(n))

journal_rank_all <- full_join(journal_rank_bp |> select(sourcetitle, rank_in_bp),journal_rank_th |> select(sourcetitle, rank_in_th), by = "sourcetitle")

fct_DT(journal_rank_all)

Keywords

Biology & Philosophy

keyword_bp <- bio_philo_papers |> select(citing_art, dc_creator, year, authkeywords, prism_publication_name)

keyword_bp_cleaned <- keyword_bp |> 
  separate_rows(authkeywords, sep = " \\| ") |> 
  filter(!is.na(authkeywords)) 

keyword_bp_cleaned$authkeywords <- toupper(keyword_bp_cleaned$authkeywords)

# KEYWORD PLUS WORD CLOUD
keyword_bp_count <-  keyword_bp_cleaned |> 
 select(authkeywords) |> count(authkeywords, sort=TRUE)

keyword_bp_count <- fct_percent(keyword_bp_count)
fct_DT(keyword_bp_count)

Biological Theory

keyword_th <- bio_th_papers |> select(citing_art, dc_creator, year, authkeywords, prism_publication_name)

keyword_th_cleaned <- keyword_th |> 
  separate_rows(authkeywords, sep = " \\| ") |> 
  filter(!is.na(authkeywords))

keyword_th_cleaned$authkeywords <- toupper(keyword_th_cleaned$authkeywords)

# KEYWORD PLUS WORD CLOUD
keyword_th_count <-  keyword_th_cleaned |> 
 group_by(authkeywords) |> count(authkeywords, sort=TRUE)

keyword_th_count <- fct_percent(keyword_th_count)
fct_DT(keyword_th_count)

Word clouds

Now that we have these tables, it might be of use to visualize those keywords and their importance.

Biology & Philosophy.

Biological Theory

An important thing to note is that we don’t have access to the keywords of the references provided by Scopus

Citation delay

Here, we compute what we call citation delay. It is computed as the difference between the article publishing year and the mean year of the references the article cites. Here is the cumulative distribution function that shows the evolution of this citation delay as the journal gets older.

# BIOLOGY & PHILOSOPHY
bio_philo_papers <- bio_philo_papers |> mutate(date = as.Date(prism_cover_date)) |> 
  mutate(year = year(date))

delay_refs_bp <- clean_references_bp |> 
  rename(cited_year = year) |>
  left_join(bio_philo_papers |> select(citing_art, year), 
            by = "citing_art") |>
  arrange(desc(citing_art))



delay_refs_bp <- delay_refs_bp |> mutate(delay = year-cited_year) |> mutate(from = "B&P")


delay_refs_bp$decade <- cut(delay_refs_bp$year, 
                            breaks = c(2006, 2014, 2023),  # Include up to 2024
                            labels = c("2006-2013", "2014-2021"),
                            right = FALSE)  # Left-inclusive


p1 <- ggplot(delay_refs_bp |> filter(!is.na(decade)), aes(x = delay, color = decade, group = decade)) +
  stat_ecdf(geom = "step", show.legend = FALSE) +
  labs(title = "Citation Delay Biology & Philosophy (1987-2022)",
       x = "Delay",
       y = "CDF") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_x_continuous(limits = c(0, 50)) 
# BIOLOGICAL THEORY
delay_refs_th <- clean_references_th |> 
  rename(cited_year = year) |>
  left_join(bio_th_papers |> 
  select(citing_art, year), by = "citing_art") |>
  arrange(desc(citing_art))


delay_refs_th <- delay_refs_th |> mutate(delay = year-cited_year) |> mutate(from = "BT")


delay_refs_th$decade <- cut(delay_refs_th$year, 
                    breaks = c(2006, 2014, 2023), 
                    labels = c("2006-2013", "2014-2021"),
                    right = FALSE)  # left-inclusive


p2 <- ggplot(delay_refs_th |> filter(!is.na(decade)), aes(x = delay, color = decade, group = decade)) +
  stat_ecdf(geom = "step", show.legend = FALSE) +
  labs(title = "Citation Delay Biological Theory (2006-2022)",
       x = "Delay",
       y = "CDF") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  scale_x_continuous(limits = c(0, 50)) 
# BOTH B&P AND BT
all <- rbind(delay_refs_bp, delay_refs_th) |> filter(!is.na(decade))


p3 <- ggplot(all, aes(x = delay, color = decade, group = decade)) +
  stat_ecdf(geom = "step") +
  labs(
       x = "Delay",
       y = "CDF")  +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position="top", legend.title = element_blank(), ) +
  scale_x_continuous(limits = c(0, 50)) +
  facet_grid(rows = ~ from) 

ggplotly(p3) |> layout(legend = list(title = FALSE, orientation = "h",
                     xanchor = "center",  
                     x = 0.5, y = 1.2))

What is the distribution of the citations?

If this shift is interesting, we need to be careful. It could be only because as we go, we still cite old stuff, thus creating an artificial shift to the right not really problematic. Let’s look at the distribution of citation delay.

p4 <- all |> ggplot(aes(x = delay, group = from, fill = decade, color = decade)) +
  geom_density(alpha = 0.5) +
  facet_grid(cols = vars(from), rows = vars(decade)) 

ggplotly(p4) |> layout(legend = list(title = FALSE, orientation = "h",   # show entries horizontally
                     xanchor = "center",  # use center of legend as anchor
                     x = 0.5, y = 1.2))